-
Notifications
You must be signed in to change notification settings - Fork 13.7k
rustdoc-search: search backend with partitioned suffix tree #144476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
3f21c90
to
a3603c7
Compare
This comment has been minimized.
This comment has been minimized.
a3603c7
to
278838f
Compare
278838f
to
96b5862
Compare
This comment has been minimized.
This comment has been minimized.
96b5862
to
799c605
Compare
This comment has been minimized.
This comment has been minimized.
799c605
to
43cb8d0
Compare
This comment has been minimized.
This comment has been minimized.
43cb8d0
to
73790db
Compare
Rollup of 19 pull requests Successful merges: - #140956 (`impl PartialEq<{str,String}> for {Path,PathBuf}`) - #141744 (Stabilize `ip_from`) - #142681 (Remove the `#[no_sanitize]` attribute in favor of `#[sanitize(xyz = "on|off")]`) - #142871 (Trivial improve doc for transpose ) - #144252 (Do not copy .rmeta files into the sysroot of the build compiler during check of rustc/std) - #144476 (rustdoc-search: search backend with partitioned suffix tree) - #144567 (Fix RISC-V Test Failures in ./x test for Multiple Codegen Cases) - #144804 (Don't warn on never to any `as` casts as unreachable) - #144960 ([RTE-513] Ignore sleep_until test on SGX) - #145013 (overhaul `&mut` suggestions in borrowck errors) - #145041 (rework GAT borrowck limitation error) - #145243 (take attr style into account in diagnostics) - #145405 (cleanup: use run_in_tmpdir in run-make/rustdoc-scrape-examples-paths) - #145432 (cg_llvm: Small cleanups to `owned_target_machine`) - #145484 (Remove `LlvmArchiveBuilder` and supporting code/bindings) - #145557 (Fix uplifting in `Assemble` step) - #145563 (Remove the `From` derive macro from prelude) - #145565 (Improve context of bootstrap errors in CI) - #145584 (interpret: avoid forcing all integer newtypes into memory during clear_provenance) Failed merges: - #145359 (Fix bug where `rustdoc-js` tester would not pick the right `search.js` file if there is more than one) - #145573 (Add an experimental unsafe(force_target_feature) attribute.) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of #144476 - notriddle:notriddle/stringdex, r=lolbinarycat,GuillaumeGomez rustdoc-search: search backend with partitioned suffix tree Before: - https://notriddle.com/windows-docs-rs/doc-old/windows/ - https://doc.rust-lang.org/1.89.0/std/index.html - https://doc.rust-lang.org/1.89.0/nightly-rustc/rustc_hir/index.html After: - https://notriddle.com/windows-docs-rs/doc/windows/ - https://notriddle.com/rustdoc-html-demo-12/stringdex/doc/std/index.html - https://notriddle.com/rustdoc-html-demo-12/stringdex/compiler-doc/rustc_hir/index.html ## Summary Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called [stringdex](https://crates.io/crates/stringdex), that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, *incremental* algorithm. ## Motivation Fixes #131156 While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's. Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever [place new matches on top of old ones](https://web.dev/articles/cls). It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running. ## Explanation A longer description of how name search works can be found in stringdex's [HACKING.md](https://gitlab.com/notriddle/stringdex/-/blob/main/HACKING.md) file. Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match. ## Drawbacks It's rather complex, and takes up more disk space than the current flat list of strings. ## Rationale and alternatives Instead of a suffix tree, I could've used a different [approximate matching data structure](https://en.wikipedia.org/wiki/Approximate_string_matching). I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match [oepn](https://doc.rust-lang.org/nightly/std/?search=oepn) like the current system does). ## Prior art [Sherlodoc](https://github.com/art-w/sherlodoc) is based on a similar concept, but they: - use edit distance over a suffix tree for type-based search, instead of the binary matching that's implemented here - use substring matching for name-based search, [but not fuzzy name matching](art-w/sherlodoc#21) - actually implement body text search, which is a potential-future feature, but not implemented in this PR ## Future possibilities ### Low-level optimization in stringdex There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself. - Stringdex decides whether to bundle two nodes into the same file based on size. To figure out a node's size, I have to run compression on it. This is probably slower than it needs to be. - Stack compression is limited to the same 256-slot sliding windows as backref compression, and it doesn't have to be. (stack and backref compression are used to optimize the representation of the edge pointer from a parent node to its child; backref uses one byte, while stack is entirely implicit) - The JS-side decoder is pretty naive. It performs unnecessary hash table lookups when decoding compressed nodes, and retains a list of hashes that it doesn't need. It needs to calculate the hashes in order to construct the merkle tree correctly, but it doesn't need to keep them. - Data compression happens at the end, while emitting the node. This means it's not being counted when deciding on how to bundle, which is pretty dumb. ### Improved recall in type-driven search Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want. What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise. For example, a type theoretical improvement would fix [`Iterator<T>, (T -> U) -> Iterator<U>`](https://doc.rust-lang.org/nightly/std/?search=Iterator%3CT%3E%2C%20(T%20-%3E%20U)%20-%3E%20Iterator%3CU%3E) to give [`Iterator::map`](https://doc.rust-lang.org/nightly/std/iter/trait.Iterator.html#method.map), because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one. ## Full-text search Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching. Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks: - You have to pick between index bloat and the use of stopwords. Stopwords are bad because they might actually be important (try searching "if let" in mdBook if you're feeling brave), but without them, you spend a lot of space on text that doesn't matter. - Example code also tends to have a lot of irrelevant stuff in it. Like stop words, we'd have to pick potentially-confusing or bloat. Neither of these problems are deal-breakers, but they're worth keeping in mind.
Bors hasn't noticed that this was merged. @bors r- retry |
🎉 |
Be aware that there were some big |
Wow indeed. I expected regressions, but not that big. The end result is worth it, but we definitely need to improve that. |
regressed both files count https://perf.rust-lang.org/compare.html?start=b96868fa2ef174b0a5aeb3bf041b3a5b517f11f8&end=037df456409539306b50018308b8dfdc6f009d6c&stat=size%3Adoc_files_count&showRawData=true and and doc_bytes https://perf.rust-lang.org/compare.html?start=b96868fa2ef174b0a5aeb3bf041b3a5b517f11f8&end=037df456409539306b50018308b8dfdc6f009d6c&stat=size%3Adoc_bytes&showRawData=true. For many-assoc-items files count changed dramatically 52 -> 2310, size about 30% |
Seeing this merged PR at the top of the non-approved part of the queue makes me keep miscounting the number of rollup=always PRs, so let's fix that. @bors p=0 |
@bors r- |
…ts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of rust-lang#144476. Fixes rust-lang#145646. cc `@notriddle` r? `@fmease`
…ts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of rust-lang#144476. Fixes rust-lang#145646. cc `@notriddle` r? `@fmease`
…ts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of rust-lang#144476. Fixes rust-lang#145646. cc ``@notriddle`` r? ``@fmease``
…ts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of rust-lang#144476. Fixes rust-lang#145646. cc ```@notriddle``` r? ```@fmease```
…ts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of rust-lang#144476. Fixes rust-lang#145646. cc ````@notriddle```` r? ````@fmease````
…ts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of rust-lang#144476. Fixes rust-lang#145646. cc `````@notriddle````` r? `````@fmease`````
Rollup merge of #145650 - GuillaumeGomez:fix-js-search-scripts-path, r=fmease Fix JS search scripts path `rootPath` always end with a `/` so we should not add one. Interestingly enough, it only triggers the bug on a website (like https://doc.rust-lang.org/nightly/std/). Follow-up of #144476. Fixes #145646. cc `````@notriddle````` r? `````@fmease`````
Before:
After:
Summary
Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called stringdex, that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, incremental algorithm.
Motivation
Fixes #131156
While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's.
Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever place new matches on top of old ones. It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running.
Explanation
A longer description of how name search works can be found in stringdex's HACKING.md file.
Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match.
Drawbacks
It's rather complex, and takes up more disk space than the current flat list of strings.
Rationale and alternatives
Instead of a suffix tree, I could've used a different approximate matching data structure. I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match oepn like the current system does).
Prior art
Sherlodoc is based on a similar concept, but they:
Future possibilities
Low-level optimization in stringdex
There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself.
Improved recall in type-driven search
Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want.
What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise.
For example, a type theoretical improvement would fix
Iterator<T>, (T -> U) -> Iterator<U>
to giveIterator::map
, because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one.Full-text search
Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching.
Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks:
Neither of these problems are deal-breakers, but they're worth keeping in mind.